ECE 6605 - Information Theory

Prof. Matthieu Bloch

Thursday, August 31, 2023

Announcements

Assignment 1
- Check Piazza for Q&A
- Come to office hours on Friday September 01, 2023 11 am in TSRB 423
- You can work on Problem 1 already, but all others will make a lot more sense after today
Errata
- Keep me honest
- Strict convexity of \(f\) required for \(\mathbb{D}_f(P\Vert Q)=0\Rightarrow P=Q\)
- Lecture notes coming
Last time
- Jensen's inequality
- \(f\)-divergence
Today
- Shannon entropy
- Mutual information
- Fano's inequality
- Renyi entropy

\(f\)-divergence

Let \(f:\bbR\to\bbR\) be a convex function such that \(f(1)=0\). The \(f\)-divergence between two distributions \(P,Q \in\Delta(\calX)\) is

\[\begin{align*} \bbD_f(P\Vert Q)\eqdef \sum_{x\in\calX}f\left(\frac{P(x)}{Q(x)}\right)Q(x), \end{align*}\]

with the convention that \(\bbD_f(P\Vert Q)=\infty\) if \(P\) is not absolutely continuous with respect to \(Q\).

For any \(P,Q \in \Delta(\calX)\) \(\bbD_f(P\Vert Q)\geq 0\). If \(f\) is strictly convex, equality holds if and only if \(P=Q\).

A function \(W:\calX\times\calY:(x,y)\mapsto W(y|x)\) is a Markov kernel (also called transition probability) from \(\calX\) to \(\calY\) if for all \(x,y\in\calX\times\calY\) \(W(y|x)\geq 0\) and for all \(x\in\calX\) \(\sum_{y\in\calY}W(y|x)=1\).

Note: this notation can be confusing at first; you can write \(P_{Y|X}\) is it's easier.
- This notation allows us to write \(W\cdot P\) for a joint distribution and \(W\circ P\) for the marginal

More properties of \(f\) divergence

Let \(P,Q\in\Delta(\calX)\) and let \(W\) be a Markov kernel from \(\calX\) to \(\calY\).

\[\begin{align*} \bbD_f(W\circ P\Vert W\circ Q) \leq \bbD_f(W\cdot P\Vert W\cdot Q) = \bbD_f(P\Vert Q). \end{align*}\]

The \(f\)-divergence \(\bbD:\Delta(\calX)\times\Delta(\calX)\to\bbR^+:(p,q)\mapsto\mathbb{D}_f(P\Vert Q)\) is jointly convex, i.e., for all \(P_1,Q_1,P_2,Q_2\in\Delta(\calX)\) and \(\lambda\in[0;1]\)

\[\begin{align*} \mathbb{D}_f\left(\lambda P_1+(1-\lambda) P_2\Vert \lambda Q_1+(1-\lambda) Q_2\right)\leq \lambda\mathbb{D}_f(P_1\Vert Q_1)+(1-\lambda)\mathbb{D}_f(P_2\Vert Q_2). \end{align*}\]

Special cases of \(f\)-divergence

For \(P,Q\in\Delta(\calX)\), the total variation between \(P\) and \(Q\) is the \(f\)-divergence for \(f:t\mapsto\abs{t-1}\), i.e.,

\[\begin{align} \mathbb{V}\left(P,Q\right)\eqdef \frac{1}{2}\norm[1]{P-Q}\eqdef \frac{1}{2}\sum_{x\in\calX}\abs{P(x)-Q(x)}. \end{align}\]

For \(P,Q\in\Delta(\calX)\), the relative entropy, also called Kullback-Leibler divergence, between \(P\) and \(Q\) is the \(f\)-divergence for \(f:t\mapsto t\ln t\), i.e.,

\[\begin{align} \bbD(P\Vert Q)\eqdef \sum_{x\in\calX}P(x)\ln\frac{P(x)}{Q(x)}, \end{align}\]

with the convention that \(\bbD_f(P\Vert Q) = \infty\) if \(P\) is not absolutely continuous with respect to \(Q\).

Entropy

Axiomatic construction of entropy H for \(P_X\in\Delta(\calX)\)
1. \(H(X)\) only depends on the set of values \(\{p_X(x)\}_{x\in\calX}\) but not on their order.
2. for \(X\sim B(p)\) with \(p\in[0;1]\), \(H(X)\) is a continuous function of \(p\).
3. if \(X\sim B(1/2)\), then \(H(X)=1\).
4. If \(t\in[0;1]\) and \(\set{p_i}_{i=1}^n\in\Delta(\calX)\), then \(H(tp_1,(1-t)p_1,p_2,\cdots,p_n)=p_1H(t,1-t)+ H(p_1,p_2,\cdots,p_n)\)

The only function that satisfies the axioms is \(H(X) = -\sum_xP_X(x)\log_2 P_X(x)\).

Let \(X\in\calX\) be a discrete random variable with \(\card{\calX}<\infty\). The Shannon entropy of \(X\) is defined as

\[\begin{align} \mathbb{H}(X) \eqdef \E[X]{-\log_2 P_X(X)} = - \sum_{x \in {\calX}} P_X(x) \log_2 P_X(x), \end{align}\]

with the convention that \(0\ln 0 = 0\).

Properties of entropy

The entropy \(H:\Delta(\calX)\to\bbR:P\mapsto H(P)\) is a concave function.

Let \(X\in\calX\) be a discrete random variable. Then \(\mathbb{H}(X) \geq 0\) with equality iff \(X\) is a constant, i.e., there exists \(x^*\in\calX\) such that \(\P{X=x^*}=1\).

Let \(X \in {\calX}\) be a discrete random variable. Then \(\mathbb{H}(X)\leq \ln \card{\calX}\) with equality if and only if \(X\) is uniform over \({\calX}\), i.e., \(\forall x \in {\calX}, P_X(x) = \frac{1}{\card{\calX}}\).

Joint and conditional entropy

The joint entropy of two discrete random variables \(X\in\calX\) and \(Y\in\calY\) with joint PMF \(P_{XY}\) is

\[\begin{align*} \mathbb{H}(X,Y) \eqdef \mathbb{E}_{XY}(-\log_2 P_{XY}(X,Y)) = - \sum_{x \in {\calX}} \sum_{y \in \mathcal{Y}}P_{XY}(x, y) \log_2 P_{XY}(x, y). \end{align*}\]

Furthermore, the conditional entropy \(Y\) given \(X\) is

\[\begin{align*} \mathbb{H}(Y|X) \eqdef \mathbb{E}_{XY}(-\log_2 P_{Y|X}(Y|X)) = - \sum_{x \in {\calX}} \sum_{y \in \mathcal{Y}} P_{XY}(x,y) \log_2 P_{Y|X}(y|x). \end{align*}\]

Let \(X, Y\) be discrete random variables with joint PMF \(p_{XY}\). Then \(\mathbb{H}(Y|X) \geq 0\) with equality if and only if \(Y\) is a function of \(X\).

Let \(X\) and \(Y\) be discrete random variables with joint PMF \(p_{XY}\). Then \(\mathbb{H}(X|Y) \leq \mathbb{H}(X)\) with equality if and only if \(X\) is independent of \(Y\).

Chain rule of entropy

Let \(X\), \(Y\), and \(Z\) be discrete random variable with joint PMF \(p_{XYZ}\). Then \[ \mathbb{H}(XY|Z) = \mathbb{H}(X|Z) + \mathbb{H}(Y|XZ) \notag = \mathbb{H}(Y|Z) + \mathbb{H}(X|YZ). \] More generally, if \(\bfX\eqdef\set{X_i}_{i=1}^n\) and \(Z\) are jointly distributed random variables we have \[ \mathbb{H}(\bfX|Z) = \sum_{i=1}^n\mathbb{H}(X_i|X^{1:i-1}Z) \] with the convention \(X^{1:0}\eqdef \emptyset\).

Mutual information

Let \(X, Y\) be two random variables with joint PMF \(P_{XY}\). The mutual information between \(X\) and \(Y\) is \[ \bbI(X;Y)\eqdef \bbH(X)-\bbH(X|Y)=\bbD(P_{XY}\Vert P_XP_Y). \]

If \(\Delta(\calX\to\calY)\) denotes the set of kernels from \(\calX\) to \(\calY\) then the function \[ I:\Delta(\calX)\times \Delta(\calX\to\calY):(P,W)\mapsto I(P;W)\eqdef \mathbb{D}(W\cdot P\Vert (W\circ P)P) \] is a concave function of \(P\) and a convex function of \(W\).

Conditional mutual information

Let \(X, Y, Z\) be discrete random variables with joint distribution \(p_{XYZ}\). The conditional mutual information between \(X\) and \(Y\) given \(Z\) is \(\mathbb{I}(X ; Y|Z) \eqdef \mathbb{H}(X|Z) - \mathbb{H}(X|YZ)\).

Using Baye's rule, \(\mathbb{I}\left(X;Y|Z\right) = \E[Z]{\mathbb{D}(P_{XY|Z}\Vert P_{X|Z}P_{Y|Z})}\).

\(\mathbb{I}(X;Y|Z)\geq 0\) if and only if \(X\) and \(Y\) are conditionally independent given \(Z\).

Let \(\bfX\eqdef\set{X_i}_{i=1}^n\), \(Y\), and \(Z\) be jointly distributed random variables. Then,

\[\begin{align*} \mathbb{I}(\bfX;Y|Z) = \sum_{i=1}^n\mathbb{I}(X_i;Y|Z\bfX^{1:i-1})\textsf{ with the convention }\bfX^{1:0}=\emptyset. \end{align*}\]

Let \(X\), \(Y\), and \(Z\) be discrete random variables such that \(X \rightarrow Y \rightarrow Z\). Then \(\mathbb{I}(X;Y) \geq \mathbb{I}(X;Z)\) or, equivalently, \(\mathbb{H}(X|Z) \geq \mathbb{H}(X|Y)\).

Fano's inequality

Let \(X\) be a discrete random variable with alphabet \({\calX}\). Let \(\widehat{X}\) be an estimate of \(X\), \(\widehat{X} \in {\calX}\) and with joint distribution \(p_{X\widehat{X}}\). We define the probability of estimation error: \(P_e \eqdef \mathbb{P}[X\neq\widehat{X}]\). Then, \(\mathbb{H}(X|\widehat{X}) \leq \mathbb{H}_b(P_e) + P_e \log(\card{\calX}-1)\).

Renyi entropy

Weaker axiomatic construction of entropy H for \(P_X\in\Delta(\calX)\)
1. \(H(X)\) only depends on the set of values \(\{p_X(x)\}_{x\in\calX}\) but not on their order.
2. for \(X\sim B(p)\) with \(p\in[0;1]\), \(H(X)\) is a continuous function of \(p\).
3. if \(X\sim B(1/2)\), then \(H(X)=1\).

Let \(X\in\calX\) be a discrete random variable. The R\'enyi entropy of order \(\alpha\in\mathbb{R}\cup\{\infty\}\) of \(X\) is

\[\begin{align*} H_{\alpha}(X)\eqdef \frac{1}{1-\alpha}\log \sum_{x\in\calX} p_X(x)^\alpha, \end{align*}\]

with the convention that \(H_1(X)=\lim_{\alpha\rightarrow 1} H_\alpha(X)\) and \(H_\infty(X)=\lim_{\alpha\rightarrow \infty} H_\alpha(X)\).

Let \(X\in\calX\) be a discrete random variable. Then,

\(H_0(X)=\log \card{\calX}\) is called the max-entropy
\(H_1(X)\eqdef\lim_{\alpha\rightarrow 1} H_\alpha(X) = \mathbb{H}(X)\), the Shannon entropy;
\(H_2(X)=-\log\sum_{x}p_X(x)^2\) is called the collision entropy.
\(H_\infty(X)\eqdef\lim_{\alpha\rightarrow \infty} H_\alpha(X) = -\log \max_{x}p_X(x)\) is called the min-entropy.